Automatic detection and annotation of disfluencies in spoken French corpora

نویسندگان

  • George Christodoulides
  • Mathieu Avanzi
چکیده

In this paper we propose a multi-step system for the semiautomatic detection and annotation of disfluencies in spoken corpora. A set of rules, statistical models and machine learning techniques are applied to the input, which is a transcription aligned to the speech signal. The system uses the results of an automatic estimation of prosodic, part-of-speech and shallow syntactic features. We present a detailed coding scheme for simple disfluencies (filled pauses, mispronunciations, false starts, drawls and intra-word pauses), structured disfluencies (repetitions, deletions, substitutions, insertions) and complex disfluencies. The system is trained and evaluated on a transcribed corpus of spontaneous French speech, consisting of 112 different speakers and balanced for speaker age and sex, covering 14 different varieties of French spoken in Belgium, France and Switzerland.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech

We present DisMo, a multi-level annotator for spoken language corpora that integrates part-of-speech tagging with basic disfluency detection and annotation, and multi-word unit recognition. DisMo is a hybrid system that uses a combination of lexical resources, rules, and statistical models based on Conditional Random Fields (CRF). In this paper, we present the first public version of DisMo for ...

متن کامل

A methodology for the automatic detection of perceived prominent syllables in spoken French

Prosodic transcription of spoken corpora relies mainly on the identification of perceived prominence. However, the manual annotation of prominent phenomena is extremely timeconsuming, and varies greatly from one expert to another. Automating this procedure would be of great importance. In this study, we present the first results of a methodology aiming at an automatic detection of prominence sy...

متن کامل

Paraphrastic Reformulations in Spoken Corpora

Our work addresses the automatic detection of paraphrastic reformulation in French spoken corpora. The proposed approach is syntagmatic. It is based on specific markers and the specificities of the spoken language. Manual multi-dimensional annotation performed by two annotators provides fine-grained reference data. An automatic method is proposed in order to decide whether sentences contain or ...

متن کامل

Repurposing Corpora for Speech Repair Detection: Two Experiments

Unrehearsed spoken language often contains many disfluencies. If we want to correctly interpret the content of spoken language, we need to be able to detect these disfluencies and deal with them appropriately. In the work described here, we use a statistical noisy channel model to detect disfluencies in transcripts of spoken language. Like all statistical approaches, this is naturally very data...

متن کامل

Detection and Analysis of Paraphrastic Reformulations in Spoken Corpora (Repérage et analyse de la reformulation paraphrastique dans les corpus oraux) [in French]

Our work addresses the automatic detection of paraphrastic rephrasing in spoken corpus. The proposed approach is syntagmatic. It is based on paraphrastic rephrasing markers and the specificities of the spoken language. Manual annotation performed by two annotators provides fine-grained and multi-dimensional description of the reference data. Automatic method is proposed in order to decide wheth...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015